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Abstract — The problem of learning tree-structured Gaussian 
graphical models from independent and identically distributed 
(i.i.d.) samples is considered. The influence of the tree structure 
and the parameters of the Gaussian distribution on the learning 
rate as the number of samples increases is discussed. Specifically, 
the error exponent corresponding to the event that the estimated 
tree structure differs from the actual unknown tree structure of 
the distribution is analyzed. Finding the error exponent reduces 
to a least-squares problem in the very noisy learning regime. 
In this regime, it is shown that the extremal tree structure that 
minimizes the error exponent is the star for any fixed set of 
correlation coefficients on the edges of the tree. If the magnitudes 
of all the correlation coefficients are less than 0.63, it is also shown 
that the tree structure that maximizes the error exponent is the 
Markov chain. In other words, the star and the chain graphs 
represent the hardest and the easiest structures to learn in the 
class of tree-structured Gaussian graphical models. This result 
can also be intuitively explained by correlation decay: pairs of 
nodes which are far apart, in terms of graph distance, are unlikely 
to be mistaken as edges by the maximum-likelihood estimator in 
the asymptotic regime. 

Index Terms — Structure learning, Gaussian graphical models, 
Gauss-Markov random fields, Large deviations. Error exponents. 
Tree distributions, EucHdean information theory. 



I. Introduction 

Learning of structure and interdependencies of a large 
collection of random variables from a set of data samples 
is an important task in signal and image analysis and many 
other scientific domains (see examples in [Tl-fT] and refer- 
ences therein). This task is extremely challenging when the 
dimensionality of the data is large compared to the number 
of samples. Furthermore, structure learning of multivariate 
distributions is also complicated as it is imperative to find 
the right balance between data fidelity and overfitting the data 
to the model. This problem is circumvented when we limit the 
distributions to the set of Markov tree distributions, which have 
a fixed number of parameters and are tractable for learning 131 
and statistical inference (T\, (4). 

The problem of maximum-likelihood (ML) learning of a 
Markov tree distribution from i.i.d. samples has an elegant 
solution, proposed by Chow and Liu in 15J. The ML tree 
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Structure is given by the maximum-weight spanning tree 
(MWST) with empirical mutual information quantities as the 
edge weights. Furthermore, the ML algorithm is consistent (6\, 
which implies that the error probability in learning the tree 
structure decays to zero with the number of samples available 
for learning. 

While consistency is an important qualitative property, there 
is substantial motivation for additional and more quantitative 
characterization of performance. One such measure, which we 
investigate in this theoretical paper is the rate of decay of the 
error probability, i.e., the probability that the ML estimate of 
the edge set differs from the true edge set. When the error 
probability decays exponentially, the learning rate is usually 
referred to as the error exponent, which provides a careful 
measure of performance of the learning algorithm since a 
larger rate implies a faster decay of the error probability. 

We answer three fundamental questions in this paper, (i) 
Can we characterize the error exponent for structure learning 
by the ML algorithm for tree-structured Gaussian graphical 
models (also called Gauss-Markov random fields)? (ii) How 
do the structure and parameters of the model influence the 
error exponent? (iii) What are extremal tree distributions for 
learning, i.e., the distributions that maximize and minimize 
the error exponents? We believe that our intuitively appealing 
answers to these important questions provide key insights 
for learning tree-structured Gaussian graphical models from 
data, and thus, for modeling high-dimensional data using 
parameterized tree-structured distributions. 

A. Summary of Main Results 

We derive the error exponent as the optimal value of the 
objective function of a non-convex optimization problem, 
which can only be solved numerically (Theorem |2]i. To gain 
better insights into when errors occur, we approximate the 
error exponent with a closed-form expression that can be 
interpreted as the signal-to-noise ratio (SNR) for structure 
learning (Theorem HI, thus showing how the parameters of 
the true model affect learning. Furthermore, we show that due 
to correlation decay, pairs of nodes which are far apart, in 
terms of their graph distance, are unlikely to be mistaken as 
edges by the ML estimator. This is not only an intuitive result, 
but also results in a significant reduction in the computational 
complexity to find the exponent - from 0{d'^^'^) for exhaus- 
tive search and 0{d'^) for discrete tree models |7| to 0{d) for 
Gaussians (Proposition |7]i, where d is the number of nodes. 

We then analyze extremal tree structures for learning, given 
a fixed set of correlation coefficients on the edges of the tree. 
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Our main result is the following: The star graph minimizes the 
error exponent and if the absolute value of all the correlation 
coefficients of the variables along the edges is less than 0.63, 
then the Markov chain also maximizes the error exponent 
(Theorem |8]l. Therefore, the extremal tree structures in terms 
of the diameter are also extremal trees for learning Gaussian 
tree distributions. This agrees with the intuition that the 
amount of correlation decay increases with the tree diameter, 
and that correlation decay helps the ML estimator to better 
distinguish the edges from the non-neighbor pairs. Lastly, 
we analyze how changing the size of the tree influences the 
magnitude of the error exponent (Propositions fTTI and fT2ll. 

B. Related Work 

There is a substantial body of work on approximate learning 
of graphical models (also known as Markov random fields) 
from data e.g. ISl- lfTTI . The authors of these papers use 
various score-based approaches fSl, the maximum entropy 
principle |9j or £i regularization |10|, 111] as approximate 
structure learning techniques. Consistency guarantees in terms 
of the number of samples, the number of variables and 
the maximum neighborhood size are provided. Information- 
theoretic hmits lfT2l for learning graphical models have also 
been derived. In fT3l|, bounds on the error rate for learning the 
structure of Bayesian networks were provided but in contrast 
to our work, these bounds are not asymptotically tight (cf. 
Theorem 121). Furthermore, the analysis in |13| is tied to the 
Bayesian Information Criterion. The focus of our paper is the 
analysis of the Chow-Liu Q algorithm as an exact learning 
technique for estimating the tree structure and comparing 
error rates amongst different graphical models. In a recent 
paper lfT4l . the authors concluded that if the graphical model 
possesses long range correlations, then it is difficult to learn. 
In this paper, we in fact identify the extremal structures and 
distributions in terms of error exponents for structure learning. 
The area of study in statistics known as covariance selec- 
tion ifTSi , |fT6l also has connections with structure learning 
in Gaussian graphical models. Covariance selection involves 
estimating the non-zero elements in the inverse covariance 
matrix and providing consistency guarantees of the estimate 
in some norm, e.g. the Frobenius norm in ifTTl . 

We previously analyzed the error exponent for learning 
discrete tree distributions in |7|. We proved that for every 
discrete spanning tree model, the error exponent for learning is 
strictly positive, which implies that the error probability decays 
exponentially fast. In this paper, we extend these results to 
Gaussian tree models and derive new results which are both 
explicit and intuitive by exploiting the properties of Gaussians. 
The results we obtain in Sections |lll] and |IV] are analogous to 
the results in [7) obtained for discrete distributions, although 
the proof techniques are different. Sections [V] and |VT] contain 
new results thanks to simplifications which hold for Gaussians 
but which do not hold for discrete distributions. 

C. Paper Outline 

This paper is organized as follows: In Section [III we state 
the problem precisely and provide necessary preliminaries on 



learning Gaussian tree models. In Section Hill we derive an 
expression for the so-called crossover rate of two pairs of 
nodes. We then relate the set of crossover rates to the error 
exponent for learning the tree structure. In Section IIVI we 
leverage on ideas from Euclidean information theory lITSi 
to state conditions that allow accurate approximations of the 
error exponent. We demonstrate in Section |Vl how to reduce 
the computational complexity for calculating the exponent. In 
Section rvD we identify extremal structures that maximize and 
minimize the error exponent. Numerical results are presented 
in Section rvni and we conclude the discussion in Section [Villi 

II. Preliminaries and Problem Statement 
A. Basics of Undirected Gaussian Graphical Models 

Undirected graphical models or Markov random field^ 
(MRFs) are probability distributions that factorize according 
to given undirected graphs ||3]. In this paper, we focus solely 
on spanning trees {i.e., undirected, acyclic, connected graphs). 
A d-dimensional random vector x — [xi, . . . ^Xd]^ G is 
said to be Markov on a spanning tree Tp — (V,£p) with 
vertex (or node) set V = {1, . . . , d} and edge set 8p C (^) 
if its distribution p(x) satisfies the (local) Markov property: 
p{xi\xv\{i}) = p{x,\x^i,d{i)), where nbd(i) := {j e V : 
E £p} denotes the set of neighbors of node i. We 
also denote the set of spanning trees with d nodes as T'^, 
thus Tp e T''. Since p is Markov on the tree Tp, its 
probability density function (pdf) factorizes according to Tp 
into node marginals {pi : j e V} and pairwise marginals 
{Pi,j ■ {hi) £ ^p} in the following specific way ||3] given the 
edge set £p: 



= n^''(^») n 



Pi{xi)pj{xjy 



(1) 



We assume that p, in addition to being Markov on the spanning 
tree Tp = (V,£p), is a Gaussian graphical model or Gauss- 
Markov random field (GMRF) with known zero mear0 and 
unknown positive definite covariance matrix S >- 0. Thus, 
p(x) can be written as 

1 f I 



p(x) 



■ exp 



-x^S- 



(2) 



(27r)''/2|S|i/2 \^ 2 

We also use the notation p{x) = A/'(x; 0, S) as a shorthand 
for For Gaussian graphical models, it is known that the 
fill-pattern of the inverse covariance matrix encodes the 
structure of p(x) [3], i.e., = if and only if (iff) 

(i, j) i £p. 

We denote the set of pdfs on K'^ by ViM."^), the set of 
Gaussian pdfs on K'* by 'Pj\f{M.'^) and the set of Gaussian 
graphical models which factorize according to some tree in 
T'^ as ■p^(K'^, T''). For learning the structure of p(x) (or 
equivalently the fill -pattern of S^^), we are provided with a 
set of d-dimensional samples x" {xi, . . . , x„} drawn from 
p, where x^ := [xk.i, Xk.d]'^ E M''. 

'in this paper, we use the terms "graphical models" and "Markov random 
fields" interchangeably. 

-Our results also extend to the scenario where the mean of the Gaussian is 
unknown and has to be estimated from the samples. 
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B. ML Estimation of Gaussian Tree Models 

In this subsection, we review the Chow-Liu ML learning 
algorithm |5 | for estimating the structure of p given samples 
x". Denoting D{pi\\p2) := Ep^ log{pi/p2) as the Kullback- 
Leibler (KL) divergence fT9l between pi and p2, the ML 
estimate of the structure £cl(x") is given by the optimization 
problenll 

£cl(x") argmin D{p\\q), (3) 

£g■■qeV^f(Wi'',T'') 

where p(x) :— JV{x; 0, S) and S := l/f^X]fc=i ^fcX^ is the 
empirical covariance matrix. Given p, and exploiting the fact 
that g in (|3]l factorizes according to a tree as in ([T]l, Chow and 
Liu im showed that the optimization for the optimal edge set 
in (|3]l can be reduced to a MWST problem: 



fcL(x") 



argmax 



e££„ 



(4) 



where the edge weights are the empirical mutual information 
quantities IJ9J given b}0 



1 



(5) 



and where the empirical correlation coefficients are given by 
Pe = ft,, := j)/(S(z,i)S0-,j))i/2. Note that in ©, the 
estimated edge set fcL(x") depends on n and, specifically, on 
the samples in x" and we make this dependence explicit. We 
assume that Tp is a spanning tree because with probability 1, 
the resulting optimization problem in (|4]l produces a spanning 
tree as all the mutual information quantities in (|5j will be non- 
zero. If Tp were allowed to be a proper forest (a tree that is not 
connected), the estimation of £p will be inconsistent because 
the learned edge set will be different from the true edge set. 

C. Problem Statement 

We now state our problem formally. Given a set of i.i.d. 
samples x" drawn from an unknown Gaussian tree model p 
with edge set £p, we define the error event that the set of edges 
is estimated incorrectly as 



(6) 



{x" : fc^x") ^ fp}, 

where £cl(x") is the edge set of the Chow-Liu ML estimator 
in (|3]l. In this paper, we are interested to compute and subse- 
quently study the error exponent Kp, or the rate that the error 
probability of the event An with respect to the true model p 
decays with the number of samples n. Kp is defined as 



Kp := lim --logP(Ai), 



(7) 



assuming the limit exists and where P is the product probabil- 
ity measure with respect to the true model p. We prove that the 
limit in (|7]i exists in Section|III](Corollary|3]l. The value of Kp 
for different tree models p provides an indication of the relative 
ease of estimating such models. Note that both the parameters 
and structure of the model influence the magnitude of Kp. 

^Note that it is unnecessary to impose the Gaussianity constraint on q in (3). 
We can optimize over 'P{R'^,T'^) instead of ■Pj^{R'',T'^). It can be shown 
that the optimal distribution is still Gaussian. We omit the proof for brevity. 

'^Our notation for the mutual information between two random variables 
differs from the conventional one in Il9l . 




Fig. 1. If the error event occurs during the learning process, an edge e g 
Path{e';£'p) is replaced by a non-edge e' ^ £p in the original model. We 
identify the crossover event that has the minimum rate and its rate is 



III. Deriving the Error Exponent 
A. Crossover Rates for Mutual Information Quantities 

To compute Kp, consider first two pairs of nodes e, e' G (2) 
such that I{pe) > I{Pe')- We now derive a large-deviation 
principle (LDP) for the crossover event of empirical mutual 
information quantities 



Ce,e' {X" </(Fe')}- 



(8) 



This is an important event for the computation of Kp because 
if two pairs of nodes (or node pairs) e and e' happen to 
crossover, this may lead to the event An occurring (see the 
next subsection). We define Je.e' = Je,e'{Pe.e'), the crossover 
rate of empirical mutual information quantities, as 

Je.e' := lim --l0gP(Ce,e')- (9) 

Here we remark that the following analysis does not depend 
on whether e and e' share a node. If e and e' do share a node, 
we say they are an adjacent pair of nodes. Otherwise, we say 
e and e' are disjoint. We also reserve the symbol m to denote 
the total number of distinct nodes in e and e'. Hence, m = 3 
if e and e' are adjacent and to = 4 if e and e' are disjoint. 

Theorem 1 (LDP for Crossover of Empirical MI): For two 
node pairs e, e' S (2) with pdf p^.e' G 7^a/-(M™) (for m — i 
or TO = 4), the crossover rate for empirical mutual information 
quantities is 



[D{q\\p,,,,):I{q,)=I{q,,)]. 



(10) 



The crossover rate Jg g/ > iff the correlation coefficients of 

Pe,e' satisfy \pe\ ^ \pe'\- 

Proof: {Sketch) This is an application of Sanov's Theo- 
rem II20I Ch. 3], and the contraction principle 1211 Ch. 3] in 
large deviations theory, together with the maximum entropy 
principle |19, Ch. 12]. We remark that the proof is different 
from the corresponding result in [7|. See Appendix lAl ■ 
Theorem [T] says that in order to compute the crossover 
rate Je^es we can restrict our attention to a problem that 
involves only an optimization over Gaussians, which is a finite- 
dimensional optimization problem. 

B. Error Exponent for Structure Learning 

We now relate the set of crossover rates {Je.e'} over all 
the node pairs e, e' to the error exponent Kp, defined in (|7|l- 
The primary idea behind this computation is the following: We 
consider a fixed non-edge e' ^ £p in the true tree Tp which 
may be erroneously selected during learning process. Because 
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of the global tree constraint, this non-edge e' must replace 
some edge along its unique path in the original model. We only 
need to consider a single such crossover event because Kp will 
be larger if there are multiple crossovers (see formal proof 
in Q). Finally, we identify the crossover event that has the 
minimum rate. See Fig. [T]for an illustration of this intuition. 

Theorem 2 (Exponent as a Crossover Event The er- 

ror exponent for structure learning of tree-structured Gaussian 
graphical models, defined in (|7]l, is given as 

Kp = min min Je.e'i (H) 

e'^fp eePath(e';£p) 

where Path(e'; 8p) C £p is the unique path joining the nodes 
in e' in the original tree Tp = (V,£p). 

This theorem implies that the dominant error tree |7|, which 
is the asymptotically most-likely estimated error tree under the 
error event An, differs from the true tree Tp in exactly one 
edge. Note that in order to compute the error exponent Kp 
in (fTTT i. we need to compute at most diam(rp)((i— l)((i— 2)/2 
crossover rates, where diam(Tp) is the diameter of Tp. Thus, 
this is a significant reduction in the complexity of computing 
Kp as compared to performing an exhaustive search over 
all possible error events which requires a total of 0{d'^~^) 
computations 1221 (equal to the number of spanning trees with 
d nodes). 

In addition, from the result in Theorem |2l we can derive 
conditions to ensure that Kp > and hence for the error 
probability to decay exponentially. 

Corollary 3 ( Condition for Positive Error Exponent): The 
error probability F{An) decays exponentially, i.e., Kp > 
iff S has full rank and Tp is not a forest (as was assumed in 
Section Hill. 

Proof: See Appendix |B] for the proof. ■ 
The above result provides necessary and sufficient condi- 
tions for the error exponent Kp to be positive, which implies 
exponential decay of the error probability in n, the number of 
samples. Our goal now is to analyze the influence of structure 
and parameters of the Gaussian distribution p on the magnitude 
of the error exponent Kp. Such an exercise requires a closed- 
form expression for Kp, which in turn, requires a closed-form 
expression for the crossover rate Je,e'- However, the crossover 
rate, despite having an exact expression in ( fTOl l. can only be 
found numerically, since the optimization is non-convex (due 
to the highly nonlinear equality constraint I{qe) = I{qe')). 
Hence, we provide an approximation to the crossover rate in 
the next section which is tight in the so-called very noisy 
learning regime. 

IV. Euclidean Approximations 

In this section, we use an approximation that only con- 
siders parameters of Gaussian tree models that are "hard" 
for learning. There are three reasons for doing this. Firstly, 
we expect parameters which result in easy problems to have 
large error exponents and so the structures can be learned 
accurately from a moderate number of samples. Hard problems 
thus lend much more insight into when and how errors occur 
Secondly, it allows us to approximate the intractable problem 
in ( [Tol l with an intuitive, closed-form expression. Finally, such 



an approximation allows us to compare the relative ease of 
learning various tree structures in the subsequent sections. 

Our analysis is based on Euclidean information theory |18|, 
which we exploit to approximate the crossover rate Je,e' and 
the error exponent Kp, defined in ^ and (|7]i respectively. The 
key idea is to impose suitable "noisy" conditions on pe.e' (the 
joint pdf on node pairs e and e') so as to enable us to relax the 
non-convex optimization problem in ( fTOt to a convex program. 

Definition 1 (e-Very Noisy Condition): The joint pdf p^ ,,/ 
on node pairs e and e' is said to satisfy the e-very noisy 
condition if the correlation coefficients on e and e' satisfy 

||Pe|-|Pe'||<e- 

By continuity of the mutual information in the correlation 
coefficient, given any fixed e and p^, there exists a 6 ~ 
5{e, > such that \I{pe) — I{Pe')\ < which means that 
if e is small, it is difficult to distinguish which node pair e 
or e' has the larger mutual information given the samples x". 
Therefore the ordering of the empirical mutual information 
quantities I{pe) and !{%') may be incorrect. Thus, if e is 
small, we are in the very noisy learning regime, where learning 
is difficult. 

To perform our analysis, we recall from Verdu ll23l Sec. IV- 
E] that we can bound the KL-divergence between two zero- 
mean Gaussians with covariance matrices Se,e' + ^e,e' and 
Sg e' as 

D{N{0, Se,e' + Ae^eOIW, Se,e' ) ) < ^^^^^^^ , 

(12) 

where ||M||i? is the Frobenius norm of the matrix M. Fur- 
thermore, the inequality in (1% is tight when the perturbation 
matrix A^.e' is small. More precisely, as the ratio of the 
singular values ^"""l^^'^'j* tends to zero, the inequality in 
(fT2l) becomes tight. To convexify the problem, we also perform 
a linearization of the nonlinear constraint set in ( fTot around 
the unperturbed covariance matrix Se,e'- This involves taking 
the derivative of the mutual information with respect to the 
covariance matrix in the Taylor expansion. We denote this 
derivative as Vs^/(Se) where /(Sg) = /(7V(0,Se)) is 
the mutual information between the two random variables of 
the Gaussian joint pdf p^. — A/^(0,Sg). We now define the 
linearized constraint set of ( fTOl l as the affine subspace 

L^iPe^e') ■■= {Ae,e, G M^^™ : /(Sg) + (VsJlSg), A,) 
= /(S,0 + (Vs^,/(Se'),AeO}, (13) 

where Ae G M^^^ is the sub-matrix of A^.e' e M™><™ (m = 
3 or 4) that corresponds to the covariance matrix of the node 
pair e. We also define the approximate crossover rate of e and 
e' as the minimization of the quadratic in (fT2l i over the affine 
subspace iA(Pe,e') defined in ( fT3] ): 

Je,e' := ^ min U^-^.A^^^l- (14) 

Eqn. (fT4l) is a convexified version of the original optimization 
in ( [Tot . This problem is not only much easier to solve, but 
also provides key insights as to when and how errors occur 
when learning the structure. We now define an additional 
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(1,4) not dominant 




Pi, 2 



P2,3 



Fig. 2. Illustration of correlation decay in a Markov chain. By Lemma|5jb), 
only the node pairs (1,3) and (2,4) need to be considered for computing 
the error exponent Kp. By correlation decay, the node pair (1, 4) will not be 
mistaken as a true edge by the estimator because its distance, which is equal 
to 3, is longer than either (1, 3) or (2, 4), whose distances are equal to 2. 
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information-theoretic quantity before stating the Euclidean 
approximation. 

Definition 2 (Information Density): Given a pairwise joint 
pdf Pi J with marginals pi and pj, the information density 
denoted by Sij : — s 



{xi,Xj) := log 



is defined as 

Pi,j (-^i : -^j ) 



Pi{Xi)pj{Xj)' 



(15) 



Hence, for each pair of variables Xi and Xj, its associated in- 
formation density Sij is a random variable whose expectation 
is the mutual information of and Xj, i.e., E,[si j] — I{pi j). 

Theorem 4 (Euclidean Approx. of Crossover Rate): 
The approximate crossover rate for the empirical mutual 
information quantities, defined in (fT4l i. is given by 



Je 



s,]f {I{p,,) - I{p,)f 



2Var(se/ - s^) 2 Var(se' - s^) 



(16) 



In addition, the approximate error exponent corresponding to 
Je,e' in ( fl4t is given by 



min min Je 

e'&Sp eGPath(e';£p) 



(17) 



Proof The proof involves solving the least squares prob- 
lem in (fT4l i. See Appendix O ■ 
We have obtained a closed-form expression for the approxi- 
mate crossover rate Je,e' in (fT6t . It is proportional to the square 
of the difference between the mutual information quantities. 
This corresponds to our intuition - that if I{pe) and I{pe') 
are relatively well separated {I{pe) ^ I{Pe')) then the rate 
Je_e' is large. In addition, the SNR is also weighted by the 
inverse variance of the difference of the information densities 
Sg—Se'. If the variance is large, then we are uncertain about the 
estimate !{%) — !{%'), thereby reducing the rate. Theorem]?] 
illustrates how parameters of Gaussian tree models affect the 
crossover rate. In the sequel, we limit our analysis to the very 
noisy regime where the above expressions apply. 

V. Simplification of the Error Exponent 

In this section, we exploit the properties of the approximate 
crossover rate in ( fTSI l to significantly reduce the complexity 
in finding the error exponent Kp to 0{d). As a motivating 
example, consider the Markov chain in Fig. ]2] From our 
analysis to this point, it appears that, when computing the 
approximate error exponent Kp in ( fTTT i. we have to consider 
all possible replacements between the non-edges (1,4), (1,3) 
and (2,4) and the true edges along the unique paths connecting 
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Fig. 3. Illustration of the properties of J{pe,p^i) in LemmajS] J(p^,p^i) 
is decreasing in \p^i \ for fixed pe (top) and J(pej , pei P<:2 ) is> increasing 
in |pei I for fixed p^^ if |pei | < Pcrit (middle). Similarly, J{pe,p^') is 
increasing in |pe| for fixed p^/ if [pe] < Pcrit (bottom). 



these non-edges. For example, (1,3) might be mistaken as a 
true edge, replacing either (1, 2) or (2, 3). 

We will prove that, in fact, to compute Kp we can ignore 
the possibility that longest non-edge (1,4) is mistaken as a 
true edge, thus reducing the number of computations for the 
approximate crossover rate Je^e'- The key to this result is 
the exploitation of correlation decay, i.e., the decrease in the 
absolute value of the correlation coefficient between two nodes 
as the distance (the number of edges along the path between 
two nodes) between them increases. This follows from the 
Markov property: 

Pe'= n P-^ ye'^Sp. (18) 

eePath(e';fp) 

For example, in Fig. ]2l |pi^4| < min{|pi^3|, |p2,4|} and 
because of this, the following lemma implies that (1,4) is 
less likely to be mistaken as a true edge than (1, 3) or (2, 4). 

It is easy to verify that the crossover rate in ( fT6t 

depends only on the correlation coefficients and pe' and not 
the variances cr|. Thus, without loss of generality, we assume 
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that all random variables have unit variance (which is still 
unknown to the learner) and to make the dependence clear, we 
now write J^ ,,i = J{p^^ppj). Finally define pcrit '■= 0.63055. 

Lemma 5 (Monotonicity of J{pe, Pe'))-' J{pe,Pe'), derived 
in ( fT6b , has the following properties; 

(a) J{pe,Pe') is an even function of both pe and pe'- 

(b) J{pe,Pe') is monotonically decreasing in \pe' \ for fixed 

Pee (-1,1). 

(c) Assuming that |pej < pcrit, then J{p^^, p^^p^^) is 
monotonically increasing in jpej for fixed p^^. 

(d) Assuming that |pe| < Pcrit^ then J{pf.,pe') is monoton- 
ically increasing in \pe\ for fixed pe'- 

See Fig. |3]for an illustration of the properties of J{pe,Pe')- 

Proof: (Sketch) Statement (a) follows from (T6[ . We prove 
(b)by showing that 9 J(pe,pe')/5|pe' I < for all \pe'\ < \pe\. 
Statements (c) and (d) follow similarly. See Appendix |D] for 
the details. ■ 

Our intuition about correlation decay is substantiated by 
Lemma |3b), which implies that for the example in Fig. |2] 
-^(P2,3,Pi,3) < J{P2.3-Pi.a), since |pi^4| < Ipi^sI due to 
Markov property on the chain ( fTSl l. Therefore, J{p2,3, Pi,4) 
can be ignored in the minimization to find Kp in ( [TtI i. Interest- 
ingly while Lemma|5jb) is a statement about correlation decay, 
LemmajSjc) states that the absolute strengths of the correlation 
coefficients also influence the magnitude of the crossover rate. 

From Lemma |5lb) (and the above motivating example in 
Fig. IS, finding the approximate error exponent Kp now 
reduces to finding the minimum crossover rate only over 
triangles ((1, 2, 3) and (2, 3, 4)) in the tree as shown in Fig. |2] 
i.e., we only need to consider J{£e,Pe') for adjacent edges. 

Corollary 6 ( Computation of Kp): Under the very noisy 
learning regime, the error exponent Kp is 

Kp = mill W{pe,,pe^), (19) 

,ej ^Cp^eir^ej 

where ~ Cj means that the edges and Cj are adjacent 
and the weights are defined as 

^(Pei , Pe2 ) := min |j (pe, ,Pe^Pe2),J (pe2 , Pea )} • (20) 

If we carry out the computations in iT% independently, the 
complexity is 0{ddeg^^^^), where dcg^jj^^ is the maximum 
degree of the nodes in the tree graph. Hence, in the worst 
case, the complexity is 0{d'^), instead of 0{d'^) if (fTTl i is 
used. We can, in fact, reduce the number of computations to 
0{d). 

Proposition 7 ( Complexity in computing Kp): The approx- 
imate error exponent Kp, derived in (fTTt . can be computed in 
linear time (d — 1 operations) as 

Kp ^ min J{pe,PeP*e), (21) 

eeSp 

where the maximum correlation coefficient on the edges 
adjacent to e E £p is defined as 

p* := max{|/9g| : e € fp, e e}. (22) 

Proof: By Lemma |5jb) and the definition of p*, we obtain 
the smallest crossover rate associated to edge e. We obtain the 



approximate error exponent Kp by minimizing over all edges 
e e £p in (|2Tll- ■ 

Recall that diam(rp) is the diameter of Tp. The computation 
of Kp is reduced significantly from C'(diam(rp)d^) in (fTTI) 
to 0{d). Thus, there is a further reduction in the complexity 
to estimate the error exponent Kp as compared to exhaustive 
search which requires 0{d'^~^) computations. This simplifi- 
cation only holds for Gaussians under the very noisy regime. 

VI. Extremal Structures for Learning 

In this section, we study the influence of graph structure on 
the approximate error exponent Kp using the concept of cor- 
relation decay and the properties of the crossover rate J^ e' in 
Lemma |5] We have already discussed the connection between 
the error exponent and correlation decay. We also proved that 
non-neighbor node pairs which have shorter distances are more 
likely to be mistaken as edges by the ML estimator Hence, we 
expect that a tree Tp which contains non-edges with shorter 
distances to be "harder" to learn (i.e., has a smaller error 
exponent Kp) as compared to a tree which contains non-edges 
with longer distances. In subsequent subsections, we formalize 
this intuition in terms of the diameter of the tree diam(Tp), 
and show that the extremal trees, in terms of their diameter, 
are also extremal trees for learning. We also analyze the effect 
of changing the size of the tree on the error exponent. 

From the Markov property in ( fTSl l. we see that for a 
Gaussian tree distribution, the set of correlation coefficients 
fixed on the edges of the tree, along with the structure Tp, are 
sufficient statistics and they completely characterize p. Note 
that this parameterization neatly decouples the structure from 
the correlations. We use this fact to study the influence of 
changing the structure Tp while keeping the set of correlations 
on the edges fixedl Before doing so, we provide a review of 
some basic graph theory. 

A. Basic Notions in Graph Theory 

Definition 3 (Extremal Trees in terms of Diameter): 
Assume that d > 3. Define the extremal trees with d nodes in 
terms of the tree diameter diam : T'^ ^ {2, . . . , d — 1} as 

7max(d) :=argmaxdiam(r), Tminid) :=argmindiam(T), 

(23) 

Then it is clear that the two extremal structures, the chain 
(where there is a simple path passing through all nodes and 
edges exactly once) and the star (where there is one central 
node) have the largest and smallest diameters respectively, i.e., 

Tn,ax{d) = rchain(d), and T^in{d) = 71tar(d)- 

Definition 4 (Line Graph): The line graph 1221 H of a 
graph G, denoted by H = C{G), is one in which, roughly 
speaking, the vertices and edges of G are interchanged. More 
precisely, H is the undirected graph whose vertices are the 
edges of G and there is an edge between any two vertices 
in the line graph if the corresponding edges in G have a 
common node, i.e., are adjacent. See Fig. H] for a graph G 
and its associated line graph H. 

Although the set of correlation coefficients on the edges is fixed, the 
elements in this set can be arranged in different ways on the edges of the 
tree. We formalize this concept in <24t . 
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(a) (b) 

Fig. 4. (a): A graph G. (b): The Une graph H = C{G) that corresponds to 
G is the graph whose vertices are the edges of G (denoted as e^) and there 
is an edge between any two vertices i and j in H if the corresponding edges 
in G share a node. 



problems for the best and worst distributions for learning are 
given by 

Pmax.p := argmax Kp, (25) 
Pmin.p ■= argmin Kp. (26) 

Thus, Pmax.p (resp. Pmin.p) Corresponds to the Gaussian tree 
model which has the largest (resp. smallest) approximate error 
exponent. 



B. Formulation: Extremal Structures for Learning 

We now formulate the problem of finding the best and worst 
tree structures for learning and also the distributions associated 
with them. At a high level, our strategy involves two distinct 
steps. Firstly and primarily, we find the structure of the optimal 
distributions in Section IVI-DI It turns out that the optimal 
structures that maximize and minimize the exponent are the 
Markov chain (under some conditions on the correlations) and 
the star respectively and these are the extremal structures in 
terms of the diameter. Secondly, we optimize over the positions 
(or placement) of the correlation coefficients on the edges of 
the optimal structures. 

Let p :— [pi, p2, ■ ■ ■ , Pd-i] be a fixed vector of feasibl^ 
correlation coefficients, i.e., pi e (—1, 1) \ {0} for all i. For 
a tree, it follows from ( fTSl l that if p,;'s are the correlation 
coefficients on the edges, then \pi\ < 1 is a necessary and 
sufficient condition to ensure that S )^ 0. Define Tid-i 
to be the group of permutations of order d — 1, hence 
elements in Tld-i are permutations of a given ordered set 
with cardinality d — 1. Also denote the set of tree-structured, 
d-variate Gaussians which have unit variances at all nodes and 
p as the correlation coefficients on the edges in some order 
as Vm[W^,T'^\p). Formally, 

7'^(R^r'^;p) := {p(x) =AA(x; 0, E) e7'A,(M^ T'') : 

= l,Vz e V,3 7rp e Ud I : ct£^ = TZp{p)} , (24) 

where cr^^ :— j) : (i, G £p\ is the length-((i— 1) vector 
consisting of the covariance element^ on the edges (arranged 
in lexicographic order) and 7rp(p) is the permutation of p 
according to TTp. The tuple (Tp, TTp, p) uniquely parameterizes 
a Gaussian tree distribution with unit variances. Note that we 
can regard the permutation TTp as a nuisance parameter for 
solving the optimization for the best structure given p. Indeed, 
it can happen that there are different TTp's such that the error 
exponent Kp is the same. For instance, in a star graph, all 
permutations TTp result in the same exponent. Despite this, we 
show that extremal tree structures are invariant to the specific 
choice of TTp and p. 

For distributions in the set Vj\f{M.'^,T'^;p), our goal is to 
find the best (easiest to learn) and the worst (most difficult 
to learn) distributions for learning. Formally, the optimization 

*We do not allow any of the correlation coefficient to be zero because 
otherwise, this would result in Tp being a forest. 

'None of the elements in S are allowed to be zero because pi ^ for 
every j G V and the Markov property in jl8t . 



C. Reformulation as Optimization over Line Graphs 

Since the number of permutations tt and number of span- 
ning trees are prohibitively large, finding the optimal distri- 
butions cannot be done through a brute-force search unless d 
is small. Our main idea in this section is to use the notion 
of line graphs to simplify the problems in ( |25] ) and ( l26b . In 
subsequent sections, we identify the extremal tree structures 
before identifying the precise best and worst distributions. 

Recall that the approximate error exponent Kp can be 
expressed in terms of the weights W{pe^, Pej) between two 
adjacent edges ei,ej as in (19[ . Therefore, we can write the 
extremal distribution in ( l25l l as 

Pmax,p = argmax min W{pe.,Pe^)- (27) 

Note that in dZTl ). £p is the edge set of a weighted graph whose 
edge weights are given by p. Since the weight is between two 
edges, it is more convenient to consider line graphs defined in 
Section IVLAl 

We now transform the intractable optimization problem 
in ( |27] | over the set of trees to an optimization problem over 
all the set of line graphs: 

Pmax.p = argmax min W{pi,pj), (28) 

and W{pi,pj) can be considered as an edge weight between 
nodes i and j in a weighted line graph H. Equivalently, ( l26b 
can also be written as in ( l28T l but with the argmax replaced 
by an argmin. 

D. Main Results: Best and Worst Tree Structures 

In order to solve ( l28T l. we need to characterize the set of line 
graphs of spanning trees £(r'^) = {C{T) : T e T"^}. This has 
been studied before ll24l Theorem 8.5], but the set C{T'^) is 
nonetheless still very complicated. Hence, solving ( l28T l directly 
is intractable. Instead, our strategy now is to identify the 
structures corresponding to the optimal distributions, Pmax.p 
and Pmin.p by exploiting the monotonicity of J{pe, Pe') given 
in Lemma |5] 

Theorem 8 (Extremal Tree Structures): The tree structure 
that minimizes the approximate error exponent Kp in ( l26b 
is given by 

Tp^^..,^T,t.Ad), (29) 

for all feasible correlation coefficient vectors p with pi e 
(-1, 1) \ {0}. In addition, if p^ e (-pcrit, Pcrit) \ {0} (where 



Pes 



(a) 



(b) 



Fig. 5. Illustration for Theorem |8] The star (a) and the chain (b) minimize 
and maximize the approximate error exponent respectively. 



Pcrit = 0.63055), then the tree structure that maximizes the 
approximate error exponent Kp in dZST l is given by 



chain 



Proof: (Idea) The assertion that rp„,i„ — Tstm-{d) 
follows from the fact that all the crossover rates for the star 
graph are the minimum possible, hence i^stai < Kp. See 
Appendix |E] for the details. ■ 
See Fig. |5] This theorem agrees with our intuition: for the star 
graph, the nodes are strongly correlated (since its diameter 
is the smallest) while in the chain, there are many weakly 
correlated pairs of nodes for the same set of correlation 
coefficients on the edges thanks to correlation decay. Hence, 
it is hardest to learn the star while it is easiest to learn the 
chain. It is interesting to observe Theorem |8] implies that the 
extremal tree structures Tp^^^ and Tp^^^ are independent 
o/ the correlation coefficients p (\i \pi\ < pcrit in the case 
of the star). Indeed, the experiments in Section IVII-BI also 
suggest that Theorem |8] may likely be true for larger ranges 
of problems (without the constraint that < Pcrit) but this 
remains open. 

The results in ( |29] l and ( [30l l do not yet provide the complete 
solution to Pmax,p and Pmin,p in (l25l l and ( |26] | since there are 
many possible pdfs in 'P^f{M.'^, T"^; p) corresponding to a fixed 
tree because we can rearrange the correlation coefficients along 
the edges of the tree in multiple ways. The only exception is 
if Tp is known to be a star then there is only one pdf in 
'Pj\f{W^ ^T'^] p), and we formally state the result below. 

Corollary 9 (Most Difficult Distribution to Learn): The 



Gaussian p 



inin,p 



(x) 



7V(x;0,S 



min,p J 



defined in 



corresponding to the most difficult distribution to learn 
for fixed p, has the covariance matrix whose upper 
triangular elements are given as Smin,p(j,j) — Pi if 
? = l,j ^ 1 and Smi„_p(i,j) — piPj otherwise. Moreover, 

if IpiL ^ •■■ ^ IPrf-il ^"'l IpiI < Pcrit = 0.63055, 
then Kp corresponding to the star graph can be written 
explicitly as a minimization over only two crossover rates: 

^Pmi„,p = mm{J{pi, P1P2), J{pd-i, Pd-iPi)}- 

Proof: The first assertion follows from the Markov 
property ( fTsT i and Theorem |8] The next result follows 
from Lemma ISjc) which implies that J{pd-i, Pd-iPi) < 
J{pk, PfePi) for all 2 < /c < d — 1. ■ 
In other words, Pmin.p is a star Gaussian graphical model with 
correlation coefficients pi on its edges. This result can also 
be explained by correlation decay. In a star graph, since the 
distances between non-edges are small, the estimator in (O is 
more likely to mistake a non-edge with a true edge. It is often 



Fig. 6. If |pi,2| < |P2,3|, then the likelihood of the non-edge (1,3) 
replacing edge (1,2) would be higher than if |pi,2| = |P2,3|- Hence, the 
weight H^(pi.2i P2,3) is maximized when equality holds. 



Pi.k 



Tree r„ 



Pl 



Subtree Tp 



(30) ^'8' ^' Illustration of Proposition 1111 Tp = {V,£p) is the original tree 



and e E £p. T^i = (V, £pi) is a subtree. The observations for learning the 
structure p' correspond to the shaded nodes, the unshaded nodes correspond 
to unobserved variables. 



useful in applications to compute the minimum error exponent 
for a fixed vector of correlations p as it provides a lower 
bound of the decay rate of F{An) for any tree distribution 
with parameter vector p. Interestingly, we also have a result 
for the easiest tree distribution to learn. 

Corollary 10 (Easiest Distribution to Learn): Assume that 
Pcrit > |pi| > IP2I > ••• > |pd-i|- Then, the Gaussian 
Pmax,p(x) =7V(x;0,S,„ax,p) defined in (|25]l, corresponding 
to the easiest distribution to learn for fixed p, has the covari- 
ance matrix whose upper triangular elements are Smax,p(*, * + 
1) = Pi for all j = 1, . . . , d - 1 and Smax,p(«, j) = HCl Pk 
for all j > i. 

Proof: The first assertion follows from the proof of 
Theorem |8] in Appendix |E] and the second assertion from the 
Markov property in ( fTSl ). ■ 
In other words, in the regime where \pi\ < pcrit, Pmax,p is 
a Markov chain Gaussian graphical model with correlation 
coefficients arranged in increasing (or decreasing) order on its 
edges. We now provide some intuition for why this is so. If 
a particular correlation coefficient pi (such that \pi\ < Pcrit) 
is fixed, then the edge weight W{pi,pj), defined in (l20b . is 
maximized when \pj\ = \pi\. Otherwise, if < \pj\ the 
event that the non-edge with correlation pipj replaces the edge 
with correlation pi (and hence results in an error) has a higher 
likelihood than if equality holds. Thus, correlations pi and 
Pj that are close in terms of their absolute values should be 
placed closer to one another (in terms of graph distance) for 
the approximate error exponent to be maximized. See Fig. |6] 

E. Influence of Data Dimension on Error Exponent 

We now analyze the influence of changing the size of the 
tree on the error exponent, i.e., adding and deleting nodes and 
edges while satisfying the tree constraint and observing sam- 
ples from the modified graphical model. This is of importance 
in many applications. For example, in sequential problems, the 
learner receives data at different times and would like to update 
the estimate of the tree structure learned. In dimensionality 
reduction, the learner is required to estimate the structure 
of a smaUer model given high-dimensional data. Intuitively, 
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learning only a tree with a smaller number of nodes is easier 
than learning the entire tree since there are fewer ways for 
errors to occur during the learning process. We prove this in 
the affirmative in Proposition [TT] 

Formally, we start with a d-variate Gaussian p S 
'P^/{M.'^,T'^; p) and consider a d'-variate pdf p' £ 
Vj^{W^' ,7"^' \p'), obtained by marginalizing p over a subset 
of variables and Tpi is the tre^ associated to the distribution 
p' . Hence d! < d and p' is a subvector of p. See Fig. |7] In our 
formulation, the only available observations are those sampled 
from the smaller Gaussian graphical model p' . 

Proposition 11 (Error Exponent of Smaller Trees): The 
approximate error exponent for learning p' is at least that of 
p, i.e., Kpi > Kp. 

Proof: Reducing the number of adjacent edges to a fixed 
edge {i, k) E 8p as in Fig. |7](where k E nbd(z)\{j}) ensures 
that the maximum correlation coefficient p*^., defined in ( |22] |. 
does not increase. By Lemma [Sjb) and ( fTTl ). the approximate 
error exponent Kp does not decrease. ■ 
Thus, lower-dimensional models are easier to learn if the set of 
correlation coefficients is fixed and the tree constraint remains 
satisfied. This is a consequence of the fact that there are fewer 
crossover error events that contribute to the error exponent Kp. 

We now consider the "dual" problem of adding a new edge 
to an existing tree model, which results in a larger tree. We are 
now provided with {d + 1) -dimensional observations to learn 
the larger tree. More precisely, given a d-variate tree Gaussian 
pdf p, we consider a (d + l)-variate pdf p" such that Tp is a 
subtree of Tp/'. Equivalently, let p [pe^ , pe^ , . . . , Pea-A be 
the vector of correlation coefficients on the edges of the graph 
of p and let p" := [p, pnow] be that of p". 

By comparing the error exponents Kp and Kpn, we can 
address the following question: Given a new edge correlation 
coefficient pnew, how should one adjoin this new edge to 
the existing tree such that the resulting error exponent is 
maximized or minimized? Evidently, from Proposition [TTl it is 
not possible to increase the error exponent by growing the tree 
but can we devise a strategy to place this new edge judiciously 
(resp. adversarially) so that the error exponent deteriorates as 
little (resp. as much) as possible? 

To do so, we say edge e contains node v if e — {v, i) and 
we define the nodes in the smaller tree Tp 

''^min '■— argminmax{|pe| : e contains node v}. (31) 
^max •= argmaxmax{|pe| : e contains node v}. (32) 
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Proposition 12 (Error Exponent of Larger Trees): Assume 
that Ipnowl < |pe|Ve E £p. Then, 

(a) The difference between the error exponents Kp — Kpn 
is minimized when Tpn is obtained by adding to Tp a 
new edge with correlation coefficient pnow at vertex wj^jj^ 
given by ( l3Tl i as a leaf. 

*Note that T^i still needs to satisfy the tree constraint so that the variables 
that are marginalized out are not arbitrary (but must be variables that form the 
first part of a node elimination order 1 3 1). For example, we are not allowed 
to marginalize out the central node of a star graph since the resulting graph 
would not be a tree. However, we can marginalize out any of the other nodes. 
In effect, we can only marginalize out nodes with degree either 1 or 2. 



CO 2.5 
B 

DC 

I 1.5 
o 



O 



0.5 



w True Rate 
— B — Approx Rate 







































1 



KPe)-l(Pe') 



X 10" 



Fig. 8. Comparison of true and approximate crossover rates in IIP) and <16t 

respectively. 



(b) The difference Kp — Kpn is maximized when the new 
edge is added to wj^^x given by ( l32l i as a leaf 

Proof The vertex given by dSTT l is the best vertex to attach 
the new edge by Lemma |5jb). ■ 
This result implies that if we receive data dimensions sequen- 
tially, we have a straightforward rule in dSTl i for identifying 
larger trees such that the exponent decreases as little as 
possible at each step. 

VII. Numerical Experiments 

We now perform experiments with the following two ob- 
jectives. Firstly, we study the accuracy of the Euclidean 
approximations (Theorem IHi to identify regimes in which the 
approximate crossover rate is close to the true crossover 
rate Je,e'- Secondly, by performing simulations we study how 
various tree structures {e.g. chains and stars) influence the error 
exponents (Theorem (8]). 



A. Comparison Between True and Approximate Rates 

In Fig. [8] we plot the true and approximate crossover rate^ 
(given in (fTOb and (fT4l l respectively) for a 4-node symmetric 
star graph, whose structure is shown in Fig. |9] The zero-mean 
Gaussian graphical model has a covariance matrix S such that 
is parameterized by 7 G (0, l/-\/3) in the following way: 
= 1 for aU i, = S"HJ: 1) = 7 for all 

j = 2,3,4 and = otherwise. By increasing 7, we 

increase the difference of the mutual information quantities 
on the edges e and non-edges e'. We see from Fig. |8] that 
both rates increase as the difference I{pe) — I{Pe') increases. 
This is in line with our intuition because if pe,e' is such that 
I{pe) — I{pe') is large, the crossover rate is also large. We also 
observe that if I{pe)—I{pe') is small, the true and approximate 
rates are close. This is also in line with the assumptions of The- 
orem 2] When the difference between the mutual information 
quantities increases, the true and approximate rates separate 
from each other. 

'This small example has sufficient illustrative power because as we have 
seen, eiTors occur locally and only involve triangles. 
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Fig. 9. Left: The symmetric star graphical model used for comparing the 
true and approximate crossover rates as described in Section IVlI-AI Right: 
The structure of a hybrid tree graph with d = 10 nodes as described in 
Section fVII-BI This is a tree with a length-d/2 chain and a order d/2 star 
attached to one of the leaf nodes of the chain. 



B. Comparison of Error Exponents Between Trees 

In Fig. [To] we simulate error probabilities by drawing i.i.d. 
samples from three d—lQ node tree graphs - a chain, a star 
and a hybrid between a chain and a star as shown in Fig. |9] 
We then used the samples to learn the structure via the Chow- 
Liu procedure |5| by solving the MWST in The d- 1 = 9 
correlation coefficients were chosen to be equally spaced in 
the interval [0.1,0.9] and they were randomly placed on the 
edges of the three tree graphs. We observe from Fig. [TOlthat 
for fixed n, the star and chain have the highest and lowest 
error probabilities P(.4„) respectively. The simulated error 
exponents given by {— logP(^„)}„gN also converge to 
their true values as n — > oo. The exponent associated to the 
star is higher than that of the chain, which is corroborated by 
Theorem [8] even though the theorem only applies in the very- 
noisy case (and for \pi\ < 0.63055 in the case of the chain). 
From this experiment, the claim also seems to be true even 
though the setup is not very-noisy. We also observe that the 
error exponent of the hybrid is between that of the star and 
the chain. 



VIII. Conclusion 

Using the theory of large deviations, we have obtained 
the error exponent associated with learning the structure of 
a Gaussian tree model. Our analysis in this theoretical paper 
also answers the fundamental questions as to which set of 
parameters and which structures result in high and low error 
exponents. We conclude that Markov chains (resp. stars) are 
the easiest (resp. hardest) structures to learn as they maximize 
(resp. minimize) the error exponent. Indeed, our numerical 
experiments on a variety of Gaussian graphical models validate 
the theory presented. We believe the intuitive results presented 
in this paper will lend useful insights for modeling high- 
dimensional data using tree distributions. 
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Fig. 10. Simulated error probabilities and eiTor exponents for chain, hybrid 
and star graphs with fixed p. The dashed lines show the true error exponent 
Kp computed numerically using ilO\ and (TT). Observe that the simulated 
error exponent converges to the true error exponent as n — > oo. The legend 
applies to both plots. 



Appendix A 
Proof of Theorem[T] 

Proof: This proof borrows ideas from f25\. We assume 
m — 4 (i.e., disjoint edges) for simplicity. The result for m = 
3 follows similarly. Let V' C V be a set of to = 4 nodes 
corresponding to node pairs e and e'. Given a subset of node 
pairs 3^ C V X V such that G y^i £ V, the set of 
feasible moments \A\ is defined as 



My := {j7e,e'eKl^l 



= ^^,,y{hj)(-y]■ (33) 



Let the set of densities with moments 7}^^^.' '■= {Vi.j ■ [hi) G 
3^} be denoted as 

By[iie,e'):={q^V{W''):¥.,[x,x,]=7^,,,,{i,j)ey}. (34) 



Lemma 13 (Sanov's Thm, Contraction Principle [20]): 
For the event that the empirical moments of the i.i.d. 
observations x" are equal to rje^e' = {Vi.j ■ ihj) G 3^}^ we 
have the LDP 



lim 




(35) 



If ^e,e' G My, the optimizing pdf q* ^, in ([35]) is given by 
(J* g,(x) oc Pe,e' W exp [J2ii,j)ey ^i^j] : where the set of 
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Fig. 11. Illustration for the proof of Corollary [5] The correlation coefficient 
on the non-edge is p^i and satisfies |pg/| = |pe| if |pg| = 1. 




U/-1 



Fig. 12. Illustration for the proof of Corollary |3] The unique path between 
io and im is (ioj «l , • . . , iKi) = Path(e'; Ep). 



constants {Oi^j : G y} are chosen such that g* G 

By{rie.e') given in (O. 

From Lemma [13] we conclude that the optimal q* g, in (|35] ) 
is a Gaussian. Thus, we can restrict our search for the optimal 
distribution to a search over Gaussians, which are parameter- 
ized by means and covariances. The crossover event for mutual 
information defined in ^ is Ce.e' — {PI' > Pe} , since in the 
Gaussian case, the mutual information is a monotonic function 
of the square of the correlation coefficient (cf Eqn. (|5])). 
Thus it suffices to consider {p^, > p^}, instead of the event 
involving the mutual information quantities. Let e = {hj), 
e' = {k,l) and rje^e' ■= {r]e,r]e' ,r]i,T]j ,T]k,m) ^ My C be 
the moments of Pe.e', where r/e :— E[xiXj] is the covariance 
of Xi and Xj, and rji :— ^[xf] is the variance of Xi (and 
similarly for the other moments). Now apply the contraction 
principle ETI Ch. 3] to the continuous map h : Aiy ^ M., 
given by the difference between the square of correlation 
coefficients ^ ^ 

/i(r,e,e'):= — -— ■ (36) 
ViVj VkVi 

Following the same argument as in fT] Theorem 2], the equal- 
ity case dominates Ce,e', i-e., the event jp^, — p^} dominates 
{pi' ^ } ^ Thus, by considering the set {rje.e' '■ h{rie,e') ~ 
0}, the rate corresponding to Ce.e' can be written as 

Je.e'= inf .g(r,e.e'): — = — , (37) 
where the function g : Aiy C — > [0, oo) is defined as 

9{'ne,e') ■= inf D{qe.e' \\Pe,e'), (38) 

and the set By{r)e.e') is defined in ( l34l i. Combining expres- 
sions in ( l37T i and dJSl l and the fact that the optimal solution 

g, is Gaussian yields Je,e' as given in the statement of the 
theorem (cf. Eqn. (fTOll). 

The second assertion in the theorem follows from the fact 
that since pe,e satisfies I{pe) 7^ I{Pe'), we have |pe| ^ \pe'\ 
since I{pe) is a monotonic function in |pe|. Therefore, q* ^, ^ 
Pe e' on a set whose (Lebesgue) measure i' is strictly positive. 
Since D{q*^^,\\pe^e') = if and only if q^^, = pe,e' almost 
everywhere- [i/], this implies that g, | |pe,e') > fl9l 

Theorem 8.6.1]. ■ 

Appendix B 
Proof of Corollary[3] 

Proof: (=>) Assume that Kp > 0. Suppose, to the 
contrary, that either (i) Tp is a forest or (ii) rank(S) < d 
amd Tp is not a forest. In (i), structure estimation of p will 

'"This is also intuitively true because the most likely way the error event 
Cj, e' occurs is when equality holds, i.e., {p^, = pf}- 



be inconsistent (as described in Section III-BI) . which implies 
that Kp = 0, a contradiction. In (ii), since p is a spanning 
tree, there exists an edge e G £p such that the correlation 
coefficient ~ ±1 (otherwise S would be full rank). In this 
case, referring to Fig. [TT]and assuming that |pe| G (0, 1), the 
correlation on the non-edge e' satisfies \pe'\ = \pe\\pg\ = \pe\, 
which implies that I{pe) ~ I{Pe')- Thus, there is no unique 
maximizer in (|4]i with the empiricals pe replaced by pe- As a 
result, ML for structure learning via (|4| is inconsistent hence 
Kp = 0, a contradiction. 

(<^=) Suppose both S ;^ and Tp not a proper forest, 
i.e., Tp is a spanning tree. Assume, to the contrary, that 
Kp = 0. Then from (XI, I{Pe) — I{Pe') for some e' ^ £p 
and some e e Path{e';£p). This implies that \pe\ ~ \pe'\- 
Let e' = {io,Hi) be a non-edge and let the unique path from 
node io to node im be (ip, ii, . . . , zm) for some M > 2. See 

Fig. [12] Then, |pe'| = Ip^o.^mI = \P^oM\\P^U^2\ ■ ■ -IPtM-l^iAll- 

Suppose, without loss of generality, that edge e = {io,ii) is 
such that \pe'\ — \pe\ holds, then we can cancel \pe'\ and 
|pi„,ij on both sides to give Ip^^^ja | . • . |piM-i.*A/l = 1- 

Cancelling p^i is legitimate because we assumed that p^i ^ 
for all e' € V x V, because p is a spanning tree. Since 
each correlation coefficient has magnitude not exceeding 1, 
this means that each correlation coefficient has magnitude 1, 
i.e., = \PiM-i.iM\ = 1- Since the correlation 

coefficients equal to ±1, the submatrix of the covariance ma- 
trix S containing these correlation coefficients is not positive 
definite. Therefore by Sylvester's condition, the covariance 
matrix S ^ 0, a contradiction. Hence, Kp > 0. ■ 

Appendix C 
Proof of Theorem|4] 

Proof: We first assume that e and e' do not share a 
node. The approximation of the KL-divergence for Gaussians 
can be written as in (fT2l) . We now linearize the constraint 
set LA(Pe,e') as defined in ( [T3]) . Given a positive definite 
covariance matrix e M'^^^, to simplify the notation, 
let /(Se) = (A/'(x; 0, Se)) be the mutual information of 
the two random variables with covariance matrix Sg. We 
now perform a first-order Taylor expansion of the mutual 
information around Sg- This can be expressed as 

/(Se + A,)=/(Se)+Tr (VsJ(Se)^Ae)+o(A,). (39) 

Recall that the Taylor expansion of log-det li26l is 
logdet(A) = logdet(B) + (A - B,B-i) + o(||A - B|1f), 
with the notation (A - B,B-i) = Tr((A - B)B-i). Using 
this result we can conclude that the gradient of / with respect 
to Se in the above expansion ( |39] l can be simplified to give 



12 



VsJ(S,) 











(40) 



the matrix 

1 

2 W^e. 'lod 

where [A]od is the (unique) off -diagonal element of the 2 x 
2 symmetric matrix A. By applying the same expansion to 
/(Se' + Ag/), we can express the linearized constraint as 



(M, A) = Tr(M^A) = - /(Se 



(41) 



where the symmetric matrix M = M(Se,e') is defined in 
the following fashion: M.{i,j) = ^[E~^]od if (hj) = e, 
M{i,3) = if = e' and M{i,j) = 

otherwise. 

Thus, the problem reduces to minimizing (over A) the 
approximate objective in (fT2l l subject to the linearized con- 
straints in i4Tl . This is a least-squares problem. By using 
the matrix derivative identities VATr(MA) = M and 
VATr((S-iA)2) ^ 2S-iAS-\ we can solve for the 
optimizer A* yielding: 

(Tr(MS))2 ^ ' 

Substituting the expression for A* into (fTZt yields 

~ ^ (/(S,)-/(SeO)^ ^ ,,,, 

4Tr((MS)2) 4Tr((MS)2) ' ^ ' 

Comparing ( l43l l to our desired result ( fTSI l. we observe 
that problem now reduces to showing that Tr((MS)2) = 
|Var(se — Se'). To this end, we note that for Gaussians, 
the information density is Se{xi,Xj) — — ilog(l — pi) — 
['E'^^]odXiXj. Since the first term is a constant, it suffices to 
compute Vai-{['E^%dXiXj - ['S'^,'^]odXkXi). Now, we define 
the matrices 



C:= 



Ci:= 



C 







C2:= 







c 



1/2^ 
1/2 

and use the following identity for the normal random vector 

{xi,Xj,Xk,xi) A/'(0,S) 

Cov{axiXj,bxkXi) = 2a& • Tr(CiSC2E), Va,6 e R, 

and the definition of M to conclude that Var(se — Sg/) = 
2Tr((MI])2). This completes the proof for the case when e 
and e' do not share a node. The proof for the case when e and 
e' share a node proceeds along exactly the same lines with a 
sUght modification of the matrix M. ■ 

Appendix D 
Proof of Lemma[5] 
Proof: Denoting the correlation coefficient on edge e 
and non-edge e' as pe and pe' respectively, the approximate 
crossover rate can be expressed as 

a{pIpI) 



J{Pe,Pe') 



Biplpl 



(44) 



where the numerator and the denominator are defined as 



Biplpl):= 



A{pIpI) 

{i-plY 



1 



log 



1- 



2 "V^-Pe 

2{pt + pl) ^pUpI + 1) 

{i-plY {i-pI,){i-pI)' 



The evenness result follows from A and B because J(pe, Pe') 
is, in fact a function of (p^^ pi,). To simplify the notation, we 
make the following substitutions: x :— p^,, and y Now 
we apply the quotient rule to (|44] |. Defining TZ :— {{x,y) e 
: ?/ e (0, l),x e (0,y)}, it suffices to show that 

m \ uf ^9A{x,y) dB{x,y) 
C{x,y) := B[x,y) — — A{x,y) — — — < 0, 

for all (x, y) E TZ. Upon simplification, we have 

log (iff) [log(T5f) C,ix,y) + C2{x,y) 



C{x,y)^- 



2il-y)^l-x)3 



where C\{x, y) ■.=y^x — 6xy — 1 — 2y + 3y^ and C2{x, y) := 
2a;2y — Qx"^ + 2x — 2y'^x + 8xy — 2y — 2y'^. Since x < y, the 
logs in C{x, y) are positive, i.e., log > 0' so it suffices 

to show that 



log 



1 



i-y 



Ci{x,y)+C2{x,y)<0. 



for all {x, y) E TZ. By using the inequality log(l + 1) < t for 
all i > —1, it again suffices to show that 

Csix, y) {y - x)Ci[x, y) + {I - y)C2{x, y) < 0. 

Now upon simplification, C3{x,y) — 3y^x — 19y^x — 3y — 
2y2 + 5y3 — Sy^x"^ + lix^y + 3a; + 8xy — 6x^, and this 
polynomial is equal to zero in TZ (the closure of TZ) iff x ^ y. 
At all other points in TZ, C^{x,y) < 0. Thus, the derivative 
of J{pe,Pe') with respect to p^r is indeed strictly negative on 
TZ. Keeping fixed, the function J{pe,p^i) is monotonically 
decreasing in p^, and hence \pe'\ - Statements (c) and (d) follow 
along exactly the same lines and are omitted for brevity. ■ 

Appendix E 
Proofs of Theorem[8]and Corollary[To1 

Proof: Proof of Tp^^.^^(p) = rstar(c'): Sort the correlation 
coefficients in decreasing order of magnitude and relabel 
the edges such that |pej > ... > |Ped_il- Then, from 
Lemma |3b), the set of crossover rates for the star graph is 

given by {J{pe^, PeiPe2)} ^ {J{Pei, PeiPe^) : i = 2,. . . ,d- 

1}. For edge ei, the correlation coefficient pe^ is the largest 
correlation coefficient (and hence results in the smallest rate). 
For all other edges {e^ : i > 2}, the correlation coefficient 
Pej^ is the largest possible correlation coefficient (and hence 
results in the smallest rate). Since each member in the set of 
crossovers is the minimum possible, the minimum of these 
crossover rates is also the minimum possible among all tree 
graphs. ■ 

Before we prove part (b), we present some properties of the 
edge weights W{pi,pj), defined in ( l20b . 

Lemma 14 (Properties of Edge Weights): Assume that all 
the correlation coefficients are bounded above by Pcrit. i-^-, 
\Pi\ < Pcrit- Then W{pi,pj) satisfies the following properties: 

(a) The weights are symmetric, i.e., W{pi, pj) = W{pj,pi). 

(b) W{pi,pj) = J{niin{\pi\,\pj\},piPj), where J is the 
approximate crossover rate given in (l44l l. 

(c) If \p,\ > \pj\ > \pk\, then 

W{p,,Pk) < min{W^(p„p,),M^(pj,pfe)}. (45) 
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n 



IP2 



i+r- 



Pi 



Pd-i 



P2 



Fig. 13. Illustration of the proof of Theorem[8] Let |pi| > ■ . . > |Pd-i|- 
The figure shows the chain H*^^^^^^ (in the line graph domain) where the 
correlation coefficients {pi} are placed in decreasing order. 



(d) If |pi| > ... > \pd-i\, then 

W{p,, pj) < W{p,,p,+i), Vj > ^ + 1, (46a) 
W{p,,Pj)<W{p,,p,,i), yj<i-l. (46b) 

Proof: Claim (a) follows directly from the definition of J 
in ( |20] |. Claim (b) also follows from the definition of J and its 
monotonicity property in Lemma IHd). Claim (c) follows by 
first using Claim (b) to establish that the RHS of ( l45T l equals 

mm{J{pj,p.jP^),J{pk,PkPj)} since \pi\ >Jpj\ > \pk\- By 
the same argument, the LHS of ( |45] ). equals J{pk, PkPi)- Now 
we have 

J{pk,PkPi) < J{pj,PjPt), J{pk,PkPi) < J{pk,PkPj), (47) 

where the first and second inequalities follow from Lem- 
mas |5jc) and |5lb) respectively. This establishes ( l45T l. Claim 
(d) follows by applying Claim (c) recursively. ■ 



Proof: Proof of Tp^ 



^(p) — Tchainid): Assume, without 
and we also 



loss of generality, that IpeJ > ... > IPe^^il 
abbreviate p^- as pi for alH = 1, . . . , d — L We use the idea 
of line graphs introduced in Section IVI-AI and Lemma [14] 
Recall that £{T^) is the set of line graphs of spanning trees 
with d nodes. From ( |28]) , the line graph for the structure of 
the best distribution Pmax.p for learning in jZST l is 

^^max,p := argmax min W{pi,pj). (48) 

HeciT'') 

We now argue that the length d — 1 chain H*^^^^ (in the line 
graph domain) with correlation coefficients {pi}'^Ii arranged 
in decreasing order on the nodes (see Fig. [T3T l is the line graph 
that optimizes ( l48T l. Note that the edge weights of i?*iiain 
given by W{pi, pi+i) for 1 < i < d — 2. Consider any other 
line graph H e C{T'^). Then we claim that 

r .J^h^ W^(P».Pj)<,. ..mill VK(p„Pj). (49) 

To prove ( |49t . note that any edge (i, j) E ^chain\^ consec- 
utive, i.e., of the form (z, i + 1). Fix any such (i, i + Define 
the two subchains of H*^^^^^ as Ht := {(1, 2), . . . , (i - 1, i)} 
and Hi+i := {{i + l,i + 2), . . . , {d - 2, d - 1)} (see Fig. \T^. 
Also, let V{n,) :={!,..., i} and VCH,+i) := {i + 1, . . . , d- 
1} be the nodes in subchains Hi and Hi+i respectively. 
Because + 1) ^ H, there is a set of edges (called cut 
set edges) := {(j,fc) e H : j e V{H,),k e ViH^+l)} to 
ensure that the line graph H remains connected[|J The edge 
weight of each cut set edge (j, fc) e Si satisfies W{pj,pk) < 
W{pi, pi+i) by ( |46] | because \j — k\ > 2 and j < i and 
k > i + 1. By considering all cut set edges {j, k) e Si for fixed 

"The line graph H = C{G) of a connected graph G is connected. In 
addition, any H G C{T'^) must be a claw-free, block graph t24i Theorem 
8.5]. 



Pi _ Pi 



A 

Pa P6 P2 Py \P^ P3 
• • • • • • • 



Fig. 14. A 7-node tree T and its line graph H = C{T) are shown 
in the left and right figures respectively. In this case H \ H*^^ . = 
{(1,4),(2,5),(4,6),(3,6)} and \ H = {(1, 2), (2, 3)' Tm)}. 

Eqn. (49\ holds because from j46), W{pi,p4,) < W{pi,p2), W(p2,P5) < 
W{p2,p3) etc. and also if < bi for i £ X (for finite X), then 



i and subsequently all {i, i+1) E -f^chaiiA-^' establish ( |49T l. 
It follows that 

min W{pi,pj)< min W{p^,pj), (50) 

because the other edges in H and i?*ijain 62) ^6 common. 
See Fig. [14] for an example to illustrate ( i49] l. 

Since the chain line graph i?*hain achieves the maximum 
bottleneck edge weight, it is the optimal line graph, i.e., 
Hnia.x,p = ^fchain- Furthermore, since the line graph of a 
chain is a chain, the best structure Tp^_^^^i^p-^ is also a chain 
and we have established dSOl l. The best distribution is given 
by the chain with the correlations placed in decreasing order, 
establishing Corollary [TO] ■ 
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